Categorize internal failures with a static failure category#4972
Conversation
Greptile SummaryThis PR stamps Armada-generated job-run failures with a static
Confidence Score: 5/5Safe to merge; the change is purely additive, populating existing proto fields that were previously left empty, with no schema or migration required. All seven error-construction sites are correctly updated, the classifier bypass is gated on a well-tested helper function, and both executor and scheduler test suites have new assertions verifying the stamps end-to-end. The only previously-uncovered gap (the non-exhaustive switch) was already flagged in an earlier review thread. No files require special attention; the logic in pod_issue_handler.go is the most complex change but is well-covered by the renamed structural-issue test and the per-type unit test for internalSubcategoryForPodIssueType. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Pod issue detected] --> B{handleNonRetryableJobIssue}
B --> C{internalSubcategoryForPodIssueType}
C -->|StuckTerminating| D["internal / stuck-terminating"]
C -->|ExternallyDeleted| E["internal / externally-deleted"]
C -->|ErrorDuringIssueHandling| F["internal / issue-handler-error"]
C -->|ActiveDeadlineExceeded| G["internal / active-deadline"]
C -->|StuckStartingUp / UnableToSchedule / FailedStartingUp / default| H[Operator Classifier]
H --> I["operator category / subcategory"]
J[Scheduler events] --> K{Error type}
K -->|LeaseExpired| L["internal / lease-expired"]
K -->|MaxRunsExceeded| M["internal / max-runs-exceeded"]
K -->|JobRejected| N["internal / job-rejected"]
O[Executor: job creation failed] --> P["internal / job-creation-failed"]
O2[Reconciliation: pod missing] --> P2["internal / pod-missing"]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[Pod issue detected] --> B{handleNonRetryableJobIssue}
B --> C{internalSubcategoryForPodIssueType}
C -->|StuckTerminating| D["internal / stuck-terminating"]
C -->|ExternallyDeleted| E["internal / externally-deleted"]
C -->|ErrorDuringIssueHandling| F["internal / issue-handler-error"]
C -->|ActiveDeadlineExceeded| G["internal / active-deadline"]
C -->|StuckStartingUp / UnableToSchedule / FailedStartingUp / default| H[Operator Classifier]
H --> I["operator category / subcategory"]
J[Scheduler events] --> K{Error type}
K -->|LeaseExpired| L["internal / lease-expired"]
K -->|MaxRunsExceeded| M["internal / max-runs-exceeded"]
K -->|JobRejected| N["internal / job-rejected"]
O[Executor: job creation failed] --> P["internal / job-creation-failed"]
O2[Reconciliation: pod missing] --> P2["internal / pod-missing"]
Reviews (19): Last reviewed commit: "Merge branch 'master' into categorize-in..." | Re-trigger Greptile |
0af53eb to
4cbb8a5
Compare
c4ab17f to
4129b79
Compare
4129b79 to
9d28e68
Compare
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
9d28e68 to
72c5d1d
Compare
|
Tick the box to add this pull request to the merge queue (same as
|
Armada-generated job-run failures previously carried no
failure_category/failure_subcategory. Only operator-classified pod failures (via the executor categorizer) were tagged, so any dashboard attributing "why did this run end" had a large unlabeled remainder for everything Armada itself decided.This change stamps the failures Armada itself authors with a static, code-owned category
internalplus a subcategory naming the cause. The boundary is the source of the error.internalcovers errors Armada produces from its own logic, where it writes a fixed, self-authored message. Errors where Armada merely relays dynamic external content (the kubelet, the K8s API, the scheduler) are left to the operator categorizer, which attributes them into the operator's own categories (infra, user_error, and so on) using its configured default when no rule matches. An operator never has to match Armada's internal error strings, and the top-level value reads as a triage signal:internalmeans the failure was Armada's own machinery, not the workload.Stamped
internal:For the structural pod issues the categorization is deterministic: the categorizer is not consulted, so a configured rule cannot override
internal. The only fallback anywhere is the categorizer's own defaultCategory, which applies to the external causes below.Not stamped
internal:The subcategory vocabulary lives in
internal/common/errormatch(a dependency-free leaf already holding the sibling condition constants), so both the executor and the scheduler stamp from one shared set without an import cycle.No proto change, no migration. The change populates the existing
Error.FailureCategory/Error.FailureSubcategoryproto fields. One observable metric effect: the executor failure counterarmada_executor_job_failure_category_totalnow reportsinternalfor the structural pod issues, which previously emitted no category (the counter is a no-op on an empty category). This is intentional.Where the stamps are visible today:
JobRunErrorsand reach the Lookoutjob_run.failure_category/failure_subcategorycolumns.JobErrors(notJobRunErrors), so they do not reach the Lookoutjob_runcolumns. The api event conversion only copies these fields on thePodErrorarm, so they do not reach the api stream either. Their stamps are not observable anywhere today. They are set for construction-site consistency and to be ready whenJobErrorspersistence or the api conversion is extended.Follow-up (separate PR): an
onPodIssuestructured matcher so operators can categorize the external Armada-detected causes (stuck-starting-up, unschedulable, and optionally the structural ones) into their own categories without regexing Armada's messages, plus running the categorizer on the lease-return and submission paths.